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Abstract 

The method of Maximum (relative) Entropy (ME) is used to translate 
the information contained in the known form of the likelihood into a prior 
distribution for Bayesian inference. The argument is guided by intuition 
gained from the successful use of ME methods in statistical mechanics. 
For experiments that cannot be repeated the resulting "entropic prior" is 
formally identical with the Einstein fluctuation formula. For repeatable 
experiments, however, the expected value of the entropy of the likelihood 
turns out to be relevant information that must be included in the analysis. 
As an example the entropic prior for a Gaussian likelihood is calculated. 



1 Introduction 

Among the methods used to update from a prior probability distribution to a 
posterior distribution when new information becomes available there are two 
that can claim the distinction of being systematic, objective, and of wide ap- 
plicability: one is based on Bayes' theorem (for applications to physics see £Q) 
and the other is based on the maximization of (relative) entropy [2] ■ The choice 
between the two methods is dictated by the nature of the information being 
processed. 

Bayes' theorem should be used when we want to update our beliefs about 
the values of quantities on the basis of observed values of data y and of the 
known relation between them the likelihood p(y\8). The posterior distribution 
is p(6\y) oc n(6)p(y\8). The previous knowledge about 8 is codified both in the 
prior distribution tt(6) and also in the likelihood p(y\9). 

The selection of the prior is a difficult problem [3] because it is not always 
clear how to translate our previous beliefs about 9 into a distribution tt(0) in an 
objective way. One approach that seems to work, at least sometimes, is to rely 
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on experience and physical intuition but this becomes unreliable in situations 
of increasing complexity. Attempts to achieve objectivity include arguments 
invoking symmetry - generalized forms of the principle of insufficient reason 
- and arguments that seek to identify that state of knowledge that reflects 
complete ignorance. The latter suggest connections with the notion of entropy 
H] and have led to proposals for "entropic priors" |S1 E|- This brings us to 
the second method of processing information, the method of maximum entropy, 
which is designed for processing information given in the form of constraints on 
the family of posterior distributions 

In this paper we use entropic arguments to translate information into a 
prior distribution 7 . Rather than seeking a totally non-informative prior, we 
translate information that we do in fact have: the knowledge of the likelihood 
function, p(y\9), already constitutes valuable prior information. The prior thus 
obtained is an "entropic prior." The bare entropic priors discussed here apply 
to a situation where all we know about the quantities 9 is that they appear as 
parameters in the likelihood p{y\9). It is straightforward, however, to extend the 
method and incorporate additional relevant information beyond that contained 
in the likelihood. 

The first proposal of priors of this form is due to Skilling [5] for the case of 
discrete distributions. The second proposal, due to Rodriguez UJ, provided the 
generalization to the continuous case and further elaborations |3 E] ■ In section 
2 we give a derivation that is closer in spirit to applications of ME to statistical 
mechanics. A difficulty with the case of experiments that can be indefinitely 
repeated, which had been identified in [IQ, is diagnosed and resolved with the 
introduction of a hyper-parameter a in section 3. The analogy to statistical 
mechanics is important: the interpretation of a as a Lagrange multiplier affects 
how a should be estimated and is an important difference between the entropic 
prior proposed here and those of Skilling and Rodriguez. The example of a 
Gaussian likelihood is given in section 4. In section 5 we collect our conclusions 
and some final comments. 

2 The basic idea 

We use the ME method [2] to derive a prior ir(6) for use in Bayes' theorem 
p(0\y) oc p(y, 9) = n(0)p(y\9). As discussed in [TJ]|, since Bayes' theorem follows 
from the product rule we must focus our attention on p(y,9) rather than it(9). 
Thus, the relevant universe of discourse is the product 9 x Y of 6, the space of 
all 0s, and the data space Y. This important point was first made by Rodriguez 
[B] but both our derivation and final results differ from his [HUH]- 

To rank distributions on the space 9 x Y we must first decide on a prior 
m(y, 9). When nothing is known about the variables 9 - in particular, no relation 
between y and 9 is yet known - the prior must be a product m(y)fj,(9) of the 
separate priors in the spaces Y and 9 because maximizing the relative entropy 
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yields p(y,9) oc m(y)fj,(9). This distribution reflects our state of ignorance: the 
data about y tells us absolutely nothing about 9. 

In what follows we assume that m(y) is known because it is an important 
part of understanding what data it is that has been collected. Furthermore, if 
the 9s are parameters labeling some distributions p(y\9), then for each particular 
choice of the functional form of p(y\9) there is a natural distance in the space 
9 given by the Fisher-Rao metric d£ 2 = gijd9 l d9 : > , ^1] 

9ij = J dyp(y\9) m —. . (2) 

Therefore the prior on 9 is = g 1 ^ 2 (9) where g{9) is the determinant of gij. 

Next we incorporate the crucial piece of information: of all joint distributions 
p(y,9) — n(9)p(y\9) we consider the subset where the likelihood p{y\9) has a 
fixed, known functional form. Notice that this is an unusual constraint; it is not 
an expectation value. Note also that the only information we are using about the 
quantities 9 is that they appear as parameters in the known likelihood p(y\9), 
nothing else. But, of course, should additional relevant information (i.e., an 
additional constraint) be known it should also be taken into account. 

The preferred distribution p(y, 9) is chosen by varying ir(9) to maximize 

M =-j dyd6*(9)p(y\9) log J^gj g } . (3) 
Assuming that both tt(9) and p(y\9) arc normalized the result is 

ir{9)d9 = i e s( - e) g 1/2 {9)d9 where C = J d9 g 1/2 (9) e s(e) , (4) 
and S(9) is the entropy of the likelihood, 

S{9) = - f dyp{y\9)\og^A. (5) 

J m{y) 

The entropic prior eq.(@J) is our first important result: it gives the probability 
that the value of 9 should lie within the small volume g 1 / 2 (9)d9. The preferred 
value of 9 is that which maximizes the entropy S(0) because this maximizes the 
scalar probability density expS(9). Note that eq.|0J manifestly invariant under 
changes of the coordinates 9. 

To summarize: for the special case of a fixed data space Y, that is, for 
experiments that cannot be repeated, we have succeeded in translating the in- 
formation contained in the model - the space Y, its measure m(y), and the 
conditional distribution p(y\9) - into a prior ir(9). 

But for experiments that can be repeated indefinitely the prior Q yields 
nonsense and we have a problem. Indeed, let us assume that 9 is not a "random" 
variable, its value is fixed but unknown. For iV independent repetitions of an 
experiment, the joint distribution in the space O x Y N is 

P(/ N \9) = irW(6)p(yW\9) = ^ N \9)p{ Vl \9) . ..p(y N \6), (6) 
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and maximization of the appropriate entropy gives |1(J| 

^(»)^ S 1/2 (^^», (7) 

which is clearly wrong. The dependence of ttW on the amount N of data would 
lead us to a perpetual revision of the prior as more data is collected. For large 
N the data becomes irrelevant. 

The problem, as we will see next, is not a failure of the ME method but a 
failure to include all the relevant information. Indeed, when an experiment can 
be repeated we actually know more than just p{y^ N ">\8) = p{y\\0) ■ ■ -p{vn\^)- 
We also know that discarding the values of say 3/21 •■ ■ Vn, yields an experiment 
that is indistinguishable from the single, N = 1, experiment. This additional 
information, which is expressed by J dy 2 ■ ■ ■ dy^ p(y^ N \0) = p(yi,0) leads to 
tv^ n ^(9) = tt^(8) for all N. Next we identify a constraint that codifies this 
information within each space x Y . 



3 More information: the Lagrange multiplier a 

For large N the prior (0) in eq. Q reflects an overwhelming preference for 
the value of 9 that maximizes the entropy S(6). Indeed, as N — > 00 we have 

(S) = I d9 ttW (9)S(0) "-=3° S(6 max ) , (8) 



which is manifestly incorrect. This suggests that information about the actual 
numerical value S of the expected entropy (S) is very relevant (because if S 
were known the problem above would not arise) and that we should maximize 
crW subject to an additional constraint on S. Naturally, additional steps will 
be needed to estimate the unknown S. A similar argument justifying the intro- 
duction of constraints in statistical physics is explored in [2]. 
We maximize the entropy 



= - / dB^iOW^^^^^ (9 ) 

subject to constraints on (S) and that n be normalized. (An unimportant factor 
of N d / 2 has been dropped from the Fisher-Rao measure g( N ^{6).) The result is 

tt(0) = ^g 1/2 (6) exp [(N + X N )S(0)} . (10) 

The undesired dependence on N is eliminated if the Lagrange multipliers Xn in 
each space x Y N are chosen so that N + Xn = ol is a constant independent 
of N. The resulting entropic prior, 

= 7^r3 1/2 W e QS(e) (11) 
C(a) 
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is our second important result. The prior Tr(0\a) incorporates information con- 
tained in the likelihood plus information about 



(5) = 5(a) = -f-logC(a) where CO) = f d0 g 1/2 (9)e aS ^ . (12) 
da J 

The last step would be to estimate a and 9 from Bayes' theorem 

p(a, %W) = ^{a)^9\a) p{ y^ , (13) 

where ir(a) is a prior for a. However, if we are only interested in 6, we can just 
marginalize over a to get 



^)=Jda P (a,9\y^)=m^^- (14) 



7f(0) = / dair(a)ir(9\a) . (15) 



p(V\y 

where 



The averaged ff (9) is our final expression for the entropic prior. It is independent 
of the actual data y^ N ' as it should. 

Next we assign an entropic prior to a. We start by pointing out that a is 
not on the same footing and should not be treated like the other parameters 
9 because the relation between a and the data y is indirect: a is related to 9 
through Tv(9\a), and 9 is related to y through p(y\9). Once 9 is given, the data 
y contains no further information about a. Since the whole significance of a is 
derived purely from ir(9\a), ca. fTTJl . the relevant universe of discourse is A x 
with a G A and not Ax Q x Y N as in [H] which requires the introduction of an 
endless chain of hyper-parameters. 

We therefore consider the joint distribution n(a, 9) = ■K(a)n(9\a) and obtain 
7r(a) by maximizing the entropy 

m = -jdad9^9)io g - TJ ^^ ) (16) 

where 7 X ^ 2 (a) is determined below. Since no reference is made to repeatable 
experiments in Y N there is no need for any further constraints - and no further 
hyper-parameters - except for normalization. The result is 

n(a) = V/ 2 (a)e s(Q) , (17) 

where using eas. Hll|> and ljl2|l the Fisher-Rao measure 7(a) is 



7 (a) = ldB*{0\a) " 1 <' ^ * _ dHo g ((a) 



— log7r(#|a) 
da 



da 2 



(18) 



5 



and where s(a) is given by 



.(a) = log ^ =logCW-a^gM. (19) 

This completes our derivation of the actual prior for 9: the averaged 7f(0) in 
cq. l|15l) codifies information contained in the likelihood function, plus the insight 
that for repeatable experiments, information about the expected likelihood en- 
tropy, even if unavailable, is relevant. 



4 Example: a Gaussian model 

Consider data y™ = {yi, . . . , yjy} that are scattered around an unknown value 
A*, 

V = A» + v (20) 

with (v) — and (v 2 ) — a 2 . The goal is to estimate 9 = (//, cr) on the basis 
of and the information implicit in the data space Y, its measure m(y) 
(discussed below), and the Gaussian likelihood, 



p(ylM)O-) = 



1 



'27TCr 2 ) 1/2 



exp 



0/ - a*) 2 

2<7 2 



(21) 



We asserted earlier that knowing the measure m(y) is part of knowing what 
data has been collected. In many physical situations where the data happen to 
be distributed according to ea. (|21|l the underlying space Y is invariant under 
translations and we can assume m(y) = m = constant. Indeed, the Gaussian 
distribution can be obtained by maximizing an entropy with an underlying 
constant measure and constraints on the relevant information the mean \x and 
the variance cr 2 . 

From eqs.© and l|21() the entropy of the likelihood is 



S(p,a) = log 



a 
00 



where uq 



def / e \ 1/ 2 1 



v 2ir / in 

and the corresponding Fisher- Rao measure, from eq.J2J is 



g(fi,(r) = dct 



1/a 2 
2/cr 2 



(22) 



(23) 



Note that both S(n, a) and g((i, a) are independent of \i. This means that 
if we were concerned with the simpler problem of estimating /i in a situation 
where a happens to be known, the Bayesian estimate of fi using entropic priors 
coincides with the maximum likelihood estimate. 

When a is unknown the a-dependent entropic prior, ea. Hll(l . is 



n(p,a\a) 



2 l/2 a a-2 



(24) 



6 



Since n(fi,a\a) is improper in both fi and a we must introduce high and low 
cutoffs for both /i and a. The fact that without cutoffs the model is not well 
defined is interpreted as a request for additional relevant information, namely, 
the values of the cutoffs. 

We write the range of y, as A/i = — /i^ and introduce dimensionless 
quantities £l and e#; a extends from ol — cqEl to an = (Tq/eh- Then £(a) 
and 7r(/i, a\a) are given by 



C(«) 



do a — 1 



and 



7r(/z, er|a) 



(25) 



(26) 



Note that Tt(fi,a\a = 1) reduces to da /a which is the Jeffreys prior usually 
introduced by imposing invariance under scale transformations, a — * Ac 



Writing e = (slEh) 1 ^ 2 , the prior for a, is obtained from eq.lJTSJl, 



7(a) 

and from eas. l(T5|l and (jT§)l . 



1 



(a -1)2 



21oge 



- e L 



ir (a) 



7 1 /2( a ) e i 



1 



■ exp 



1 



- — =- loge 



(27) 



(28) 



where the normalization z has been suitably redefined. 

Eas. (|27|l and (|28|l simplify in the limit e — > 0. Note that the same result is 
obtained irrespective of the order in which we let — ► and/or — > 0. The 
resulting 7(a) and 7r(a) are 



7(a) 



1 



(a -If 



and 



7r(a) 



(l-Q 



^exp 




for 
for 



a < 1 
a > 1 



(29) 



(30) 



where w(a) is normalized and is shown in Fig. ^ 

it (a) reaches its maximum value at a = 1/2. Since ir(a) 
the expected value of a and all higher moments diverge. 



~ a for a — > -co 
This suggests that 

replacing the unknown a in the prior w(0\a) by any given numerical value a is 
probably not a good approximation. 

Since a is unknown the effective prior for 9 = (//, a) is obtained marginalizing 
Tt(fX, a, a) = tt(p, <j\a)ir(a) over a, ea. p5|l . Since 7r(a) = for a > 1 as e — > 
we can safely take the limit e# — * or er# — » 00 while keeping ctl fixed, 



Tr(fj,,a,a) = < A^o- L 1- 



for 
for 



a < 1 
a > 1. 



(31) 
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7i(a) 




Figure 1: The prior n(a) for various values of the cutoff parameter e, as e — > 0. 



n ((j,, a) 





-0). 


exp 


i 

a-l 


1 


- a 



JL 



da 



A/icr 



K 2Jlog— , (32) 



(TL 



where Kq is a modified Bessel function of the second kind. This is the entropic 
prior for the Gaussian model. The function 



P(x) = -K t 
x 



o(2V^i 



(33) 



is shown in Fig. as a function of x = cf/cfl- The singularity as x — > 1 is 
integrable. 



5 Final remarks 

Using the method of maximum relative entropy we have translated the infor- 
mation contained in the known form of the likelihood into a prior distribution. 
The argument follows closely the analogous application of the ME method to 
statistical mechanics. For experiments that cannot be repeated the resulting 
"entropic prior" is formally identical with the Einstein fluctuation formula. For 
repeatable experiments, however, additional relevant information - represented 
in terms of a Lagrange multiplier a - must be included in the analysis. The 
important case of a Gaussian likelihood was treated in detail. 

We have dealt with the simplest case where all we know about the quantities 
9 is that they appear as parameters in the likelihood p(y\6). Our argument 
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Figure 2: The effective n(ji, a) is shown as P(x) = f -Ko \ 2^Jlog(x)J where 
x = (t/<tl- 

can, however, be generalized to situations where we know of additional relevant 
information beyond what is contained in the likelihood. Such information can 
be taken into account through additional constraints in the maximization of the 
entropy a. 

To conclude we comment briefly on the entropic priors proposed by Skilling 
and by Rodriguez. Skilling's prior, unlike ours, is not restricted to probabil- 
ity distributions but is intended for generic "positive additive distributions" 
jS]- Our argument, which consists in maximizing the entropy a subject to a 
constraint p(y,9) — ir(6)p(y\6), makes no sense for generic positive additive 
distributions for which there is no available product rule. Another important 
difference arises from the fact that Skilling's entropy is not, in general, dimen- 
sionless and his hyper-parameter a is interpreted some sort of cutoff carrying 
the appropriate corrective units. Difficulties with Skilling's prior were identified 



Rodriguez's approach is, like ours, derived from a maximum entropy princi- 
ple [§]. One (minor) difference is his treatment of the underlying measure m(y). 
For us knowing m(y) is part of knowing what data has been collected; for him 
m(y) is an initial guess and he suggests setting m(y) = p(y\0o) for some value 
9q. The more important difference, however, is that the number of observed 
data N is left unspecified. The space O x Y N over which distributions are 
defined, and therefore the distributions themselves, also remain unspecified. It 
is not clear what the maximization of an entropy over such unspecified spaces 
could possibly mean but a hyper-parameter a is eventually introduced and it 
is interpreted as a "virtual number of observations supporting the initial guess 



in [T2|. 
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#o-" A different interpretation is given in Since a is treated on the same 
footing as the other parameters 9,- Rodriguez's approach requires an endless 
chain of hyper-parameters. 
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